Knowledge Base Creation, Enrichment and Repair

نویسندگان

  • Sebastian Hellmann
  • Volha Bryl
  • Lorenz Bühmann
  • Milan Dojchinovski
  • Dimitris Kontokostas
  • Jens Lehmann
  • Uros Milosevic
  • Petar Petrovski
  • Vojtech Svátek
  • Mladen Stanojevic
  • Ondrej Sváb-Zamazal
چکیده

This chapter focuses on data transformation to RDF and Linked Data and furthermore on the improvement of existing or extracted data especially with respect to schema enrichment and ontology repair. Tasks concerning the triplification of data are mainly grounded on existing and well-proven techniques and were refined during the lifetime of the LOD2 project and integrated into the LOD2 Stack. Triplification of legacy data, i.e. data not yet in RDF, represents the entry point for legacy systems to participate in the LOD cloud. While existing systems are often very useful and successful, there are notable differences between the ways knowledge bases and Wikis or databases are created and used. One of the key differences in content is in the importance and use of schematic information in knowledge bases. This information is usually absent in the source system and therefore also in many LOD knowledge bases. However, schema information is needed for consistency checking and finding modelling problems. We will present a combination of enrichment and repair steps to tackle this problem based on previous research in machine learning and knowledge representation. Overall, the Chapter describes how to enable tool-supported creation and publishing of RDF as Linked Data (Sect. 1) and how to increase the quality and value of such large knowledge bases when published on the Web (Sect. 2). 1 Linked Data Creation and Extraction 1.1 DBpedia, a Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia Wikipedia is the 6th most popular website, the most widely used encyclopedia, and one of the finest examples of truly collaboratively created content. There are 1 http://www.alexa.com/topsites. Retrieved in May 2014. c © The Author(s) S. Auer et al. (Eds.): Linked Open Data, LNCS 8661, pp. 45–69, 2014. DOI: 10.1007/978-3-319-09846-3 3 46 S. Hellmann et al. official Wikipedia editions in 287 different languages which range in size from a couple of hundred articles up to 3.8 million articles (English edition). Besides of free text, Wikipedia articles consist of different types of structured data such as infoboxes, tables, lists, and categorization data. Wikipedia currently offers only free-text search capabilities to its users. Using Wikipedia search, it is thus very difficult to find all rivers that flow into the Rhine and are longer than 100 km, or all Italian composers that were born in the 18th century. Fig. 1. Overview of DBpedia extraction framework The DBpedia project [9,13,14] builds a large-scale, multilingual knowledge base by extracting structured data from Wikipedia editions in 111 languages. Wikipedia editions are extracted by the open source “DBpedia extraction framework” (cf. Fig. 1). The largest DBpedia knowledge base which is extracted from the English edition of Wikipedia consists of over 400 million facts that describe 3.7million things. The DBpedia knowledge bases that are extracted from the other 110 Wikipedia editions together consist of 1.46 billion facts and describe 10million additional things. The extracted knowledge is encapsulated in modular dumps as depicted in Fig. 2. This knowledge base can be used to answer expressive queries such as the ones outlined above. Being multilingual and covering an wide range of topics, the DBpedia knowledge base is also useful within further application domains such as data integration, named entity recognition, topic detection, and document ranking. The DBpedia knowledge base is widely used as a test-bed in the research community and numerous applications, algorithms and tools have been built around or applied to DBpedia. Due to the continuous growth of Wikipedia and 2 http://meta.wikimedia.org/wiki/List of Wikipedias Knowledge Base Creation, Enrichment and Repair 47 Fig. 2. Overview of the DBpedia data stack. improvements in DBpedia, the extracted data provides an increasing added value for data acquisition, re-use and integration tasks within organisations. While the quality of extracted data is unlikely to reach the quality of completely manually curated data sources, it can be applied to some enterprise information integration use cases and has shown to be relevant in several applications beyond research projects. DBpedia is served as Linked Data on the Web. Since it covers a wide variety of topics and sets RDF links pointing into various external data sources, many Linked Data publishers have decided to set RDF links pointing to DBpedia from their data sets. Thus, DBpedia became a central interlinking hub in the Web of Linked Data and has been a key factor for the success of the Linked Open Data initiative. The structure of the DBpedia knowledge base is maintained by the DBpedia user community. Most importantly, the community creates mappings from Wikipedia information representation structures to the DBpedia ontology. This ontology unifies different template structures, both within single Wikipedia language editions and across currently 27 different languages. The maintenance of different language editions of DBpedia is spread across a number of organisations. Each organisation is responsible for the support of a certain language. The local DBpedia chapters are coordinated by the DBpedia Internationalisation Committee. The DBpedia Association provides an umbrella on top of all the DBpedia chapters and tries to support DBpedia and the DBpedia Contributors Community. 1.2 RDFa, Microdata and Microformats Extraction Framework In order to support web applications to understand the content of HTML pages, an increasing number of websites have started to semantically markup their 48 S. Hellmann et al. pages, that is, embed structured data describing products, people, organizations, places, events, etc. into HTML pages using such markup standards as Microformats, RDFa and Microdata. Microformats use style definitions to annotate HTML text with terms from a fixed set of vocabularies, RDFa allows embedding any kind of RDF data into HTML pages, and Microdata is part of the HTML5 standardization effort allowing the use of arbitrary vocabularies for structured data. The embedded data is crawled together with the HTML pages by search engines, such as Google, Yahoo! and Bing, which use these data to enrich their search results. Up to now, only these companies were capable of providing insights [15] into the amount as well as the types of data that are published on the web using different markup standards as they were the only ones possessing large-scale web crawls. However, the situation changed with the advent of the Common Crawl, a non-profit foundation that crawls the web and regularly publishes the resulting corpora for public usage on Amazon S3. For the purpose of extracting structured data from these large-scale web corpora we have developed the RDFa, Microdata and Microformats extraction framework that is available online. The extraction consists of the following steps. Firstly, a file with the crawled data, in the form of ARC or WARC archive, is downloaded from the storage. The archives usually contain up to several thousands of archived web pages. The framework relies on the Anything To Triples (Any23) parser library for extracting RDFa, Microdata, and Microformats from HTML content. Any23 outputs RDF quads, consisting of subject, predicate, object, and a URL which identifies the HTML page from which the triple was extracted. Any23 parses web pages for structured data by building a DOM tree and then evaluates XPath expressions to extract the structured data. As we have found that the tree generation accounts for much of the parsing cost, we have introduced the filtering step: We run regular expressions against each archived HTML page prior to extraction to detect the presence of structured data, and only run the Any23 extractor when potential matches are found. The output of the extraction process is in NQ (RDF quads) format. We have made available two implementations of the extraction framework, one based on the Amazon Web Services, and the second one being a Map/Reduce implementation that can be run over any Hadoop cluster. Additionally, we provide a plugin to the Apache Nutch crawler allowing the user to configure the crawl and then extract structured data from the resulting page corpus. To verify the framework, three large scale RDFa, Microformats and Microdata extractions have been performed, corresponding to the Common Crawl 3 http://microformats.org/ 4 http://www.w3.org/TR/xhtml-rdfa-primer/ 5 http://www.w3.org/TR/microdata/ 6 http://commoncrawl.org/ 7 https://subversion.assembla.com/svn/commondata/ 8 https://any23.apache.org/ Knowledge Base Creation, Enrichment and Repair 49 data from 2009/2010, August 2012 and November 2013. The results of the 2012 and 2009/2010 are published in [2] and [16], respectively. Table 1 presents the comparative summary of the three extracted datasets. The table reports the number and the percentage of URLs in each crawl containing structured data, and gives the percentage of these data represented using Microformats, RDFa and Microdata, respectively. Table 1. Large-scale RDF datasets extracted from Common Crawl (CC): summary CC 2009/2010 CC August 2012 CC November 2013 Size(TB), compressed 28.9 40.1 44 Size, URLs 2,565,741,671 3,005,629,093 2,224,829,946 Size, Domains 19,113,929 40,600,000 12,831,509 Parsing cost, USD 576 398 263 Structured data, 147,871,837 369,254,196 585,792,337 URLs with triples Structured data, in % 5.76 12.28 26.32 Microformats, in % 96.99 70.98 47.48 RDFa, in % 2.81 22.71 26.48 Microdata, in % 0.2 6.31 26.04 Average num. of 3.35 4.05 4.04 triples per URL The numbers illustrate the trends very clearly: in the recent years, the amount of structured data embedded into HTML pages keeps increasing. The use of Microformats is decreasing rapidly, while the use of RDFa and especially Microdata standards has increased a lot, which is not surprising as the adoption of the latter is strongly encouraged by the biggest search engines. On the other hand, the average number of triples per web page (only pages containing structured data are considered) stays the same through the different version of the crawl, which means that the data completeness has not changed much. Concerning the topical domains of the published data, the dominant ones are: persons and organizations (for all three formats), blogand CMS-related metadata (RDFa and Microdata), navigational metadata (RDFa and Microdata), product data (all three formats), and event data (Microformats). Additional topical domains with smaller adoption include job postings (Microdata) and recipes (Microformats). The data types, formats and vocabularies seem to be largely determined by the major consumers the data is targeted at. For instance, the RDFa portion of the corpora is dominated by the vocabulary promoted by Facebook, while the Microdata subset is dominated by the vocabularies promoted by Google, Yahoo! and Bing via schema.org. 50 S. Hellmann et al. More detailed statistics on the three corpora are available at the Web Data Commons page. By publishing the data extracted from RDFa, Microdata and Microformats annotations, we hope on the one hand to initialize further domain-specific studies by third parties. On the other hand, we hope to lay the foundation for enlarging the number of applications that consume structured data from the web.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

LOD2 Deliverable D3.4.1: Report on Relevant Automatically Detectable Modelling Errors and Problems

This report documents a survey about modelling errors and problems in semantic web knowledge bases. It identifies different types of errors which can typically occur during the creation and lifecycle of knowledge bases like OWL ontologies or interlinked data. Additionally, an overview about existing tool support is given. This will show which tool covers which kinds of errors/problems. From thi...

متن کامل

Universty Intellectual Capitals, A base for organizing academic planning

Today, the creation and management of knowledge assets play a decisive role in maintaining the viability and value creation of universities. However, it seems still no agreement has been formed on the most fundamental knowledge assets that generally are intangible. The existing understandings of knowledge assets of universities which are generally partial, personal and not tested, have failed t...

متن کامل

بررسی زمینه فرایند دانش‌آفرینی در نظام آموزش عالی به منظور ارائه مدل مناسب

  In spite of increasing importance of knowledge as the most strategic resource of organization there is a deep gap in understanding the contexts and components of "knowledge creation" and its effects on organization performance. Specially,there is a considerable weakess inability for using quantitative modeling techniques in this new realm. Considering these weaknesses, this study was performe...

متن کامل

LOD2 Deliverable D3.3.1: Release of Knowledge Base Enrichment Algorithms

This is a deliverably accompanying a software release on knowledge base enrichment algorithms. The information in this document reflects only the author’s views and the European Community is not liable for any use that may be made of the information contained therein. The information in this document is provided “as is” without guarantee or warranty of any kind, express or implied, including bu...

متن کامل

Automatic Hashtag Recommendation in Social Networking and Microblogging Platforms Using a Knowledge-Intensive Content-based Approach

In social networking/microblogging environments, #tag is often used for categorizing messages and marking their key points. Also, since some social networks such as twitter apply restrictions on the number of characters in messages, #tags can serve as a useful tool for helping users express their messages. In this paper, a new knowledge-intensive content-based #tag recommendation system is intr...

متن کامل

Filling Knowledge Gaps in Text for Machine Reading

Texts are replete with gaps, information omitted since authors assume a certain amount of background knowledge. We define the process of enrichment that fills these gaps. We describe how enrichment can be performed using a Background Knowledge Base built from a large corpus. We evaluate the effectiveness of various openly available background knowledge bases and we identify the kind of informat...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014